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ROBUSTNESS OF REGRESSION ESTIMATES FOR 
ORDERED DICHOTOMOUS VARIABLES 



ABSTRACT 

The purpose of this paper is to examine the validity of regression 
estimates when skewed dichotomous scales are used as independent 
variables. When Pearson product-moment correlations are used to 
measure zero-order associations involving dichotomous variables, the 
resulting coefficients underestimate the true .associations , As a result, 
using product-moment correlations involving dichotomous variables in 
regression equations apparently yields biased partial regression 
estimates. The analysis reported here was based on fifty sets of 
simulated data with 300 cases and four independent variables in each set. 
We found that tetrachoric and polyserial correlations for associations 
involving dichotomized variables yield more accurate estimates of the 
regression coefficients based on the underlyinj^ continuous data than did 
product-moment correlations for the same dichotomous variables. 



ROBUSTNESS OF REGRESSION ESTIMATES FOR 
ORDERED DICHOTOMOUS VARIABLES 

One of the most frequently used measures of association is the 
product-moment correlation coefficient. Not only are correlation 
coefficients used by themselves as bivariate measures of association, they 
are frequently used in matrix form as input into multivariate analyses 
such as factor analysis or multiple regression. While no assumptions are 
required for the computation of a product-moment correlation, the 
interpretation of the resulting coefficient certainly depends on whether 
or not the data fit an appropriate statistical model (Carroll, 1961). In 
particular, the product-moment correlation measures a linear relationship 
between two continuous variables. When a product-moment correlation is 
used to estimate the degree of association between two categorical 
variables with underlying continuities, the possible range of values for a 
product-moment correlation may not be from -1.0 to +1.0, but is 
dependent on the marginal distributions. The true association between 
two categorical variables will be under-estimated with a product -moment 
correlation (Carroll, 19G1; Ferguson, 1941; Muthen, 1983a, 1983b; 
Pearson, 1900, 1904, 1913). 

Muthen (1983a) vividly illustrated the underestimation of the 
association between two categorical variables when estimated with 
product-moment correlations. Muthen began his illustration with two 
continuous variables whose true correlation was ,50. He then 
categorized the same data into two, three, four, and five categories with 
varying degrees of skewness. Such variables are those one would 
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encounter, for example, if one used an agree-disag ree questionnaire 
format or a Likert scale to measure some underlying continuous 
attitudinai scale, ' For two dichotomous variables with zero skewness 
(i.e., a 50/50 split) Muthen showed that the product-moment correlation 
was .33, and decreased to a mere .10 when a variable split 90/10 was 
correlated with a variable split 10/90. Thus, the degrees of association 
among all categorical variables are underestimated with product-moment 
correlations, but the 'greatest underestimation occurs with highly skewed 
d I c h ot omo us variables. 

Associations ^among categorical variables are more appropriately 
measured by tetrachoric or polychoric correlations (Carroll, 1961; Muthen 
1983a, 1983b; Pearson, 1913; Pearson and Pearson, 1922). Such 
correlations are estimates of population correlations among latent, 
continuous response variables (Brown and Benedetti, 1977; Muthen, 
1983b),. and are calculated with reference to threshold values estimated 
from the marginal distributions. The use of these correlations may be 
thought of as robustifying the correlations against categorization and 
skewness, or ""stretching"" the correlations to assume values between -1.0 
and +1.0 (Muthen, 1983b) . 

A related instance of an underestimation of association occurs when 
product-moment correlations are used to measure the degree of 
association between continuous and ordered categorical variables. Such 
associations are more appropriately measured v/ith polyserial correlations 
(Jaspen, 1946; Olsson, Drasgow, and Dorans, 1982; Pearson, 1913). As 
before, the degree of underestimation is greatest when one of the 
variables is a dichotomy and both variables are highly skewed. 
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For a long time it has been known that product-moment correlations 
underestimate associations among ordered categorical data. When 
product-moment correlations are used to measure associations among such 
data, and then used in matrix form as input into multivariate analyses 
such as factor analysis or multiple regression, the consequences are less 
clear. Muthen (1983a) and Olsson (1979) have demonstrated in factor 
analytic models that the use of tetrachoric in place of product-moment 
correlations produced mor^e satisfactory solutions. Using product-moment 
correlations and standard estimation procedut^es in such situations 
resulted in downwardly biased estimates of the factor loadings of the 
categorical variables, and also frequently resulted in a greater number of 
factors . 

The question immediately arises whether multiple regression 
estimates will also be affected by the choice of zero-ordet^ measures of 
association for ordered categorical variables. Consider, for example, an 
equation with two independent variables, in which the dependent variable 
and the first independent variable are measured on continuous scales, 
while the second independent variable is a skewed dichotomous scale witli 
a 70/30 split. Furthermore, consider that the population correlations 
among all three variables are .50, .30, and .50. The resulting 
standaf^Jized /^egression equation would be: 

Y' .33X^ ^ .33X^. 

If product-momefit correlations had been used to measure the zero-order 
associations for all three variables, the resulting correlation coefficients 
would have been .50, .38, and .38, [respectively (Muthen, 1983a), and 
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the resulting standardized regression equation would be: 

Y- = .42X^ + .22X2- 

Using product -moment correlation coefficients would thus underestimate 
the true .influence of the dichotomous variable (X^) on the dependent 
variable. Furthermore, the regression coefficient for the continuous 
independent variable (X^) would overestimate its true effect, since the 
estimated correlation between the two independent variables is less than 
the true level of their association. 

While appropriate estimation procedures for measuring associations 
involving dichotomous (or other ordered polychotomous ) independent 
variables have been known for some time, the complexity of the 
calculations involved has prevented their use in practice. New 
developments, however, now make their calculation relatively easy 
(Joreskog and Sorbom, 1983; Muthen, 1982). Even so, such associations 
are still often measured with product-moment correlation coefficients. 
Such zero-order correlations, however, are known to be biased, and 
apparently yield biased partial regression estimates. Accordingly, the 
purpose of this paper is to examine the robustness of regression 
estimates when skewed dichotomous variables are included among the 
independent regr^essors . 

METHODS 

There are, of course, a wide variety of models that could have 
been estimated in this study. Here we chose only one. The model used 
here contained four independent variables. Two of the variables 
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(X^ and Xry) may be considered, say, social backgr^ound var^iables ' 
measured on continuous scales, and only weakly or modestly related to 
the dependent variable, 'vvhich is often the case in reality. The othet^ 
two variables (X^ and X^) are more strongly related to the 
dependent variable and to each other. 

The assessri^ent of tlie robustness of regression coefficients 
estimated from product-moment corv^elation coefficients for dichotomous 
variables was based on the analysis of fifty sets of simulated data. Each 
of these sets contained 500 cases witli data for five standard normal 
variables. The averages of the product-mpirij^nt correlations among these 
continuous variables are sliown in Table 1. 



Insert Table 1 About Here 



Using the continuous variables, the regression equation was 
estimated with each of the fifty sets of data. For each independent 
variable, the fifty estimated regression coefficients were then averaged. 
The averages for the continuous-data regression coefficients represent 
the most accurate estimates of the influence of the' four independent 
variables on the dependent variable. Accordingly'* these averages were 
the standard against which subsequent estimates were compared in order 
to determine the effect of categorization and correlation type on 
regression estimates . 

After estimating tlie r^egression equations with continuous data, two 
independent variables (X^ and XJ were dichotomized in three 
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different patterns: both were split 50/50, both were split 80/20, and the 
first one was split 80/20 while the second was split 20/80, The zero- 
order associations among these variables, for each of the various methods 
of dichotomizing the variables, were first estimated with product-moment 
correlations and the regression equations estimated with SPSS"^ (SPSS, 
Inc., 1983). The zero-order associations were then re-estimated with a 
mixed matrix of product-moment, tetrachoric, and polyseriaf correlations, 

as appropriate, using LISREL-VI (Joreskog and Sorbom, 1983), and the 

X 

regression equations estimated with SPSS , (Note that generatmg the 
data as standard normal variables guarantees that subsequent 

dichotomization and the use of tetrachoric and polyserial correlations 

I 

meets the assumption that continuous distributions underlie the 
dichotomized variables . ) 

^ To determine the robustness of the regression estimates, the fifty 
sets of regression estimates were averaged for each of the six methods 
used to estimate the regression equations containing the dichotomized 
variables (i.e., 50/50 with product-moment and with mixed correlations; 
80/20 with product-moment and with mixed correlations; and 80/20 - 
20/80 with product-moment and with mixed correlations). Comparisons of 
these averages to the average regression coefficients based on the 
continuous data allow us to determine the impact of dichotomization on 
the assessment of the influence of the independent variables on the 
dependent variable. Those averages closely approximating their 
corresponding regression coefficients based on the continuous data were 
Considered robust estimates. Accurate average estimates, however, are 
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of little value if they exhibit large variability. Accordingly, the 
variances of the estimates were also computed, and are reported below. 

We hypothesized that using tetrachoric and polysertal correlations 
for associations involving dichotomous variables would yield more accurate 
estimates of the regression effecvs than would product-moment 
correlations for the samewariables . Furthermore, we hypothesized that 
tlie greater the degree /f skewness among the independent variables, the 
greater would be the h/ias among ^^egression estimates based on product- 
momelnt cofM^elations . \ 

F<ESULTS 

In the regression results reported below there ar6 four 
independent variables. Two of them (X^ and X^) were generated as 
continuous standard normal variables, and were used as such in ail of 
the regression runs. Two others (X^ and X^) were generated as 
continuous standard normal variables, but subsequently were split into 
dichotomies as discussed above. 

Table 2 shows the t^esults of these regressions. The first row of 
regression coefficients in Table 2 contains the average of the fifty 
estimations of the regression equation when all four variables were 
entered as continuous data. These coefficients, therefore, are the 
standards against which all other results should be compared. The next 
three rows of regression coefficients are the averages of the fifty 
estimations of the regression equation for the various splits; the input 
correlation matrix consisted entirely of product-moment correlations. The 
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last three rows of regression coefficients in Table 2 are the averages for 
the various splits; in these cases the input correlation matrix consisted 
of a mixture of product -moment, tetrachoric, and polyserial correlations, 
as appropriate. 



Insert Table 2 About Here 



Examination of the coefficients shown in Table 2 clearly indicates 
tliat the use of tetrachoric and , polyserial correlations for dichotomized 
variables provide better average estimates of the effects of the 
independent variables than did the use of product-moment correlations. 
The average regression coefficient fo. when all four independent 
variables were continuous was -.085. For each degree of skewness, the 
average regression coefficients for using tetrachoric and polyserial 
correlations were very close to -.085; none of the coefficients based on 
product-moment correlations was as close. The explanation is not 
complicated. Both Y, the dependent variable, and X^ were measured 
as continuous variables; tlieir average zero-order association, however, 
was small (.067). But X^ was also correlated with the other 
independent variables, and when these other variables were controlled 
the^partial effect of X^ became negative. With product-moment 
correlations, however, the zero-order associations between X^ and the 
two dichotomized -variables (X2 and X^) were underestimated; thus, 
the influence of X^ on Y was an attenuated estimate. Tl^ese 
correlations, for both the product-moment and the mixed matrices, can 
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be found in Tables 3, 4, and 5, respectively, for the three different 
splits. 



Insert Tables 3, 4, tind 5 About Here 



The same conclusions can be seen to apply also to the regression 
coefficients for X^. The use of tetre\choric a nd polyserial correlations 
led to regression estimates much closer to the regressioti coefficient 
based on continuous data than did the use of product moment 
correlations. Once again, the product-moment correlations underestimate 
the associations between- and X^, and between X^ and X.- 
thus, the partial effect of X^ on Y was overestimated. 

In contrast, the use of product-moment correlations led to an 
underestimation of the partial effect of X^ on Y, In this case, X. 
was a dichotomized variable,, and the use of a product -moment correlation 
underestimated its zero-order dissociation with the dependent variable. 
Of course, the associations among X^ and the other independent 
variables were also underestimated with product -moment correlations; but 
correcting these underestimated associations (particularly with Xj and 

by using poiyserial and tetrachoric correlations had only a modest' 
infl uence on the estimated partial effect of X^ on Y in comparison witti 
correcting the estimated association between X^ and Y itself. 

Finally, examination of the regression coefficients for X^ shown 
in Table 2 indicates no systematic biasjn the use of 'pt'oduct-moment 
versus poiyserial and tetrachoric correlations. In this case, the use of 
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product-moment correlations not only underestimated the association 

between and Y, but also underestimated the rather substantial ' 

correlations between X^ and the othet^ inde]5endent variables. 

Consequently, using product-moment correlations to control for X^, 

X2, and X^ in measuring the partial influence of X^ on Y removed 

less of an effect than the statistical control should have, -but removed it 

from an attenuated estimate of the association between X, and Y. In 

4 

contrast, using polyserial and tetrachoric correlations removed more of 

an effect of X^ on Y, but removed it from a disattenuated estimate of 

> 

the zer^o-order- association between X. and Y. 

4 

In summary, tlie estimated partial regression coefficients for 
continuous Independent variables were more accurately estimated on the 
average by using a mixed matrix of tetrachoric,, polyserial, and product- 
moment correlations when some of the variables were dichotomized than 
by using product-moment cotv^elations alone, which underestimated zero- 
order associations involving categorical variables. 

We also hypothesized that the greater the degree of skewness 
among the independent variables, the greater would be the bias among 
the estimates based on product-moment correlations. This hypothesis 
seems to be false. An examination of the average coefficients within 
correlation type in Table 2 shows that they differ only slightly. 
Furthermore, it appears that the average coefficients for degree of split 
could have varied by chance alone. To test this, a series of two-way 
analyses of variance were performed for each set of regression 
coefficients. While significant differences (wittVa = ,01) were found 
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between correlation -matri x type for , >l^, and (but not for 
X^), no significant differences vvere found among the three degrees of 
skewness for any of the four independent variables. 

CONCLUSIONS 

While it has long been known that Pearson product-moment 
correlations underestimate zero-order associations involving ordered 
categorical variables, effects on multivariate partial estimates have only 
recently been investigated. Muthen (1983a) and Olsson (1979) concluded 
that the factor loadings in factor analyses involving dichotomized 
variables were better estimated with tetrachoric than with product- 
moment correlations. 

This paper investigated the effects on partial regression 
coefficients of using an input matrix of product-moment correlations 
versus a mixed matrix of tetrachoric, polyseria!, and product-moment 
correlations. VVe concluded that the average of fifty replications using 
simulated data with a mixed matrix provided more accurate estimates of 
the regression coefficients based on continuous data than were the 
estimates generated from the input of a matrix of product-moment 
correlations . 

This does not mean, jiowever, thai the use of tetrachoric and 
polyserial correlations is cost free. In the first place, it is possible for 
a mixed matrix of tetrachoric, polychoric, polyserial, and product-moment 
correlations not to be positive definite (Joreskog and Sorbom, 1983, 
p. IV. 6). Thus, there would be no regression solution. 
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Even when the problem of positive def initeness does not develop 
(and never occurred in the 350 replicated regressions reported here) 
other costs develop. In particular, the variance of the estimates 
computed from tetrachoric and polyserial correlations were greater in 
value than the estimates generated from product-moment correlations. 
This is shown in Table 6, where one can see that every standard 
deviation for the fifty coefficients for product-moment correlations was 
less than the corresponding standard deviation for the coefficients using 
tetf^achoric and polyserial correlations. 



Insert Table 6 About Here 



Thus, one must balance the costs and benefits of using as input 
mto regression analyses product-moment versus tetrachoric and polyserial 
correlations for ordered dichotomous variables. The conservative choice 
is to use product-moment correlations, recognizing that the true effects 
may bo underestimated. The use of tetrachoric and polyserial 
correlations on the aver-age produce more accurate regression parameter 
estimates, but the greater variability of the estimates can, for a single 
case, result in estimates far from the true parameter. Knowing this, it 
would be advisable to cross validate one's results when using these 
correlations , 
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Table 1. Average Product-Moment Correlations 
among Five Standard Normal Continuous Variables 
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Y 




X2 


X3 


Y 










Xi 


. 067 








X, 


.204 


'. 344 






X3 


. 547 


. 128 


. 120 






. 479 


.261 


. 323 


. 553 
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Table 2. Summary of Regression Results 



Independent Variable s 



Correlation Type/ 

op ± 1 L. 

' - ... — — . — 


X 

1 




X 

2 


X 

3 


X 

4 

— ^ 


2 

R 


Pearson/ 
Continuous 


- . 0848 


. 1059 




. 4135 


.2359 




.3500 


Pearson/ 
c Pi / c; n 

D U/ D U 


- . 0512 


. 1304 


. 3404 


.2472 


.2551 


80/20 


- . 0487 


. 1474 


. 3033 


.2124 


.2195 


' 80/20-20/80 


- . 0555 


. 1426 


. 3238 


.2451 


. 2357 


Tetrachoric etc./ 












50/50 


- . 08.34 


. 1015 


. 4092 


.2473 


. 3532 


80/20 


- . 0832 


. 1045 


. 4112 


.2450 


. 3554 


80/20-20/80 


- . 0859 


. 1078 


. 4138 


.2405 


. 3540 



i 



\ 
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Table 3. Average Correlations for 50/50 Split 





V 


Xi 




X3 


Xu 


Y 




. 067 


.204 


. 434 


. 382 


Xi 


. 057 




.344 


. 098 


.210 


X2 


.204 


.344 




. 101 


.255 


Xj 


.545 


. 123 


. 127 




.356 


X. 


. 430 


.253 


. 320 


. 544 





Product-moment matrix is above the diagonal 
and the mixed matrix is below. 
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Table 4, Average Correlations for 80/20 Split 



X3 Xi» 

.381 .339 
.088 .079 
.079 ,234 
.333 

,550 



a 

Product-moment matrix is above the diagonal 
and the mixed matrix is below. 





Y 

0 . . .... 






Y 




. 057 


.2 04 


Xi 


, 057 




.3 44 


X, 


.204 


.3 44 




X3 


. 544 


. 125 


. . 113 


X. 


. 485 


.255 


.322 
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Table 5. Average Correlations 
for 80/20-20/80 Split^ 





Y 


Xi 


X2 


--X3 


^ X, 


Y 




. 067 


.204 


.381 


. 334 


Xi 


. 057 




.344 


.088 


. 183 


X, 


.204 


. 344 




.079 


.223 


X3 


. 544 


. 125 


. 113 




.207 




.477 


.255 


.322 


. 545 

















Product-moment matrix is above the diagonal 
and the mixed matrix is below. 
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Table 5. Standard Deviations for Average Regression 
Coefficients Shown in Table 1 



Correlation Type/ 
Split 



Independent Variables 



X 



X 



X 



X 



PeaTson/ 

Continuous 

Pearson/ 
50/50 

80/20 

80/20-20/80 
Tetrachoric etc./ 
50/50 
80/20 

80/20-20/80 



.0403 

. 0441 
. 0452 
. 0415 

;0472 
0508 
0421 



. 0353 

. 0385 
. 0405 
. 0382 

.0427 
. 0484 
. 0430 



. 0372 

. 0325 
. 0297 
. 0353 

, 0538 
0557 
0725 



. 0420 

. 0384 
. 0374 
. 0345 

. 0613 
. 0708 
. 0754 



.0320 

.0325 
.0295 
.0292 

.0435 
0457 
0454 
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